Scott Cole

8 May 2017

This notebook covers how to use python to analyze the relationship between multiple input variables and one output variable using multiple linear regression.

Outline:

  1. Regular (single-variable) linear regression
  2. Multiple linear regression
  3. Regressing out one variable from another

Import libraries

We will be using the library statsmodels to run our linear regression


In [1]:
import numpy as np
import statsmodels.formula.api as smf
import pandas as pd
import scipy as sp

%matplotlib notebook
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
import matplotlib.pyplot as plt

1. Regular linear regression

1a. Fitting a line

$y_i = \beta_0+\beta_1x_i+\epsilon_i$


In [2]:
# Define true statistics relating x and y
N_points = 10
true_beta0 = 0
true_beta1 = 2
noise_stdev = 1

# Set random seed
np.random.seed(0)

# Generate correlated data
x = np.random.randn(N_points) + 2
y = true_beta0 + true_beta1*x + np.random.randn(N_points)*noise_stdev
print('x=', x)
print('y=', y)


x= [ 3.76405235  2.40015721  2.97873798  4.2408932   3.86755799  1.02272212
  2.95008842  1.84864279  1.89678115  2.4105985 ]
y= [ 7.67214826  6.25458792  6.71851369  8.60346141  8.17897921  2.37911857
  7.39425591  3.49212732  4.10663     3.96710126]

In [3]:
# Plot x and y
plt.figure(figsize=(4,4))
plt.plot(x, y, 'k.', ms=12)
plt.xlabel('x',size=15)
plt.ylabel('y',size=15)


Out[3]:
<matplotlib.text.Text at 0x1145cfda0>

In [4]:
# Fit line to data
A = np.vstack([x, np.ones(len(x))]).T
m, b = np.linalg.lstsq(A, y)[0]
print('True statistics: y =', true_beta1, '*x +', true_beta0)
print('Estimated stats: y =', m, '*x +', b)
print('R squared (fraction of variance explained) =',np.round(sp.stats.pearsonr(x,y)[0],2))


True statistics: y = 2 *x + 0
Estimated stats: y = 2.04994401661 *x + 0.26389814044
R squared (fraction of variance explained) = 0.95

In [5]:
# Plot fitted line
plt.figure(figsize=(4,4))
plt.plot(x, y, 'k.', ms=12)
plt.plot([0,5], [true_beta1*x+true_beta0 for x in [0,5]], 'k--',label='True correlation')
plt.plot([0,5], [m*x+b for x in [0,5]], 'r--',label='Estimated correlation')
plt.xlabel('x',size=15)
plt.ylabel('y',size=15)
plt.xlim((0,5))
plt.legend(loc='best')


Out[5]:
<matplotlib.legend.Legend at 0x114854fd0>

1b. Outliers and normality of errors

Minimizing the squared error of the line fit works well if our data is normally distributed.

In this section, we simulate some data in which the error is not normally distributed, and see the fitted line is not as accurate.


In [6]:
# Simulate data with non-normal distribution of error
np.random.seed(1)
N_points = 100
x = np.random.randn(N_points) + 2
y = true_beta0 + true_beta1*x + np.random.randn(N_points)**2

In [7]:
# Fit line to data
A = np.vstack([x, np.ones(len(x))]).T
m, b = np.linalg.lstsq(A, y)[0]
print('True statistics: y =', true_beta1, '*x +', true_beta0)
print('Estimated stats: y =', m, '*x +', b)
print('R squared (fraction of variance explained) =',np.round(sp.stats.pearsonr(x,y)[0],2))


True statistics: y = 2 *x + 0
Estimated stats: y = 2.04202383003 *x + 0.805367000302
R squared (fraction of variance explained) = 0.82

In [8]:
# Plot fitted line
plt.figure(figsize=(4,4))
plt.plot(x, y, 'k.', ms=8)
plt.plot([0,5], [true_beta1*x+true_beta0 for x in [0,5]], 'k--',label='True correlation')
plt.plot([0,5], [m*x+b for x in [0,5]], 'r--', label='Estimated correlation')
plt.xlabel('x',size=15)
plt.ylabel('y',size=15)
plt.xlim((0,5))
plt.legend(loc='best')


Out[8]:
<matplotlib.legend.Legend at 0x114a4ac88>

In [9]:
plt.figure(figsize=(8,3))
errors = y - [m*xi+b for xi in x]
hist2 = plt.hist(np.random.randn(100000)*np.std(errors),np.arange(-8,8,.5),color='r', normed=True, alpha=.5,label='normal')
hist = plt.hist(errors,np.arange(-8,8,.5),color='k', normed=True, alpha=.3,label='True error')
plt.legend(loc='best')
plt.xlabel('Estimate error')
plt.ylabel('Probability')
plt.yticks(np.arange(0,1.2,.2),np.arange(0,.6,.1))


Out[9]:
([<matplotlib.axis.YTick at 0x114afeef0>,
  <matplotlib.axis.YTick at 0x114894400>,
  <matplotlib.axis.YTick at 0x114b01048>,
  <matplotlib.axis.YTick at 0x114e84dd8>,
  <matplotlib.axis.YTick at 0x114e8a7f0>,
  <matplotlib.axis.YTick at 0x114e8e208>],
 <a list of 6 Text yticklabel objects>)

1c. Importance of independence of samples

Linear regression works well only if samples are independent and identically distributed (IID). If this assumption is violated, then the computed correlation statistics are not reliable.


In [10]:
# Burrito information
np.random.seed(0)
burrito1_cost = 6 + np.random.randn(50)
burrito1_stars = 3.5 + np.random.randn(50)*.8
burrito_new_cost = 4
burrito_new_stars = np.arange(4,5.1,.1)

# Define cost and stars arrays
c = np.append(np.ones(len(burrito_new_stars))*burrito_new_cost,burrito1_cost)
s = np.append(burrito_new_stars,burrito1_stars)

# Compute correlation
print('Statistics of random data points')
print('R squared (fraction of variance explained) =',np.round(sp.stats.pearsonr(c,s)[0]**2,2))
print('p =',np.round(sp.stats.pearsonr(c,s)[1],3))

print('\nStatistics after adding in 10 non-independent data points')
print('R squared (fraction of variance explained) =',np.round(sp.stats.pearsonr(burrito1_cost, burrito1_stars)[0],2))
print('p =',np.round(sp.stats.pearsonr(burrito1_cost, burrito1_stars)[1],3))


Statistics of random data points
R squared (fraction of variance explained) = 0.13
p = 0.004

Statistics after adding in 10 non-independent data points
R squared (fraction of variance explained) = -0.06
p = 0.683

In [11]:
# Fit line to data
A = np.vstack([c, np.ones(len(c))]).T
m, b = np.linalg.lstsq(A, s)[0]

# Plot fitted line
plt.figure(figsize=(4,4))
plt.plot(c, s, 'k.', ms=8)
plt.plot([0,10], [m*x+b for x in [0,10]], 'k--')
plt.xlabel('Burrito cost')
plt.ylabel('Stars')
plt.xlim((0,10))


Out[11]:
(0, 10)

2. Multiple linear regression


In [12]:
# Load burrito data into pandas dataframe
url = 'https://docs.google.com/spreadsheet/ccc?key=18HkrklYz1bKpDLeL-kaMrGjAhUM6LeJMIACwEljCgaw&output=csv'
df = pd.read_csv(url)

# Delete unreliable ratings
import pandasql
df.Unreliable = df.Unreliable.map({'x':1,'X':1,1:1})
df.Unreliable = df.Unreliable.fillna(0)
q = """SELECT * FROM df WHERE unreliable == 0"""
df = pandasql.sqldf(q.lower(), locals())

# Rename meat:filling column because statsmodels sucks
df.rename(columns={'Meat:filling': 'Meatratio'}, inplace=True)

# Limit data to main features
df = df[['Location','Burrito','Yelp','Cost','Hunger', 'Volume', 'Tortilla', 'Temp', 'Meat',
         'Fillings', 'Meatratio', 'Uniformity', 'Salsa', 'Synergy', 'Wrap', 'overall']]
df.tail()


Out[12]:
Location Burrito Yelp Cost Hunger Volume Tortilla Temp Meat Fillings Meatratio Uniformity Salsa Synergy Wrap overall
306 Taco Villa Carnitas NaN 5.99 4.0 0.59 4.0 4.0 4.0 3.0 4.0 3.5 2.5 3.5 5.0 3.5
307 Lupe's Taco Shop TGunz NaN 10.00 3.0 1.54 3.0 4.0 3.2 3.5 1.8 1.8 3.5 3.7 3.5 3.9
308 Lupe's Taco Shop TGunz NaN 10.00 4.2 1.54 3.0 4.5 3.5 3.5 2.5 2.0 3.0 3.5 4.5 3.5
309 Filiberto's Adobada 3.0 6.25 4.4 NaN 4.2 4.5 4.0 3.0 5.0 3.8 2.5 3.0 5.0 3.7
310 Filiberto's California NaN 6.25 3.0 NaN 4.5 4.5 3.7 3.5 4.0 3.5 3.0 4.0 4.5 4.0

2a. Individual linear regressions between burrito dimensions and overall satisfaction rating (BAD)

  • Ignores redundant information across features
  • Multiple comparison problem

In [13]:
# Define dimensions of interest
dims = ['Cost', 'Hunger', 'Tortilla', 'Temp', 'Meat',
        'Fillings', 'Meatratio', 'Uniformity', 'Salsa', 'Wrap']

# Correlate each dimension to the overall satisfaction rating
results = {}
for d in dims:
    model_str = 'overall ~ ' + d
    results[d] = smf.ols(model_str, data=df, missing='drop').fit()
    print(d,', R2 =',results[d].rsquared, ', p =',np.round(results[d].pvalues[d],4))


Cost , R2 = 0.023711482916 , p = 0.0071
Hunger , R2 = 0.0237099905646 , p = 0.0069
Tortilla , R2 = 0.134042259073 , p = 0.0
Temp , R2 = 0.0688476427895 , p = 0.0
Meat , R2 = 0.462036454737 , p = 0.0
Fillings , R2 = 0.52080645318 , p = 0.0
Meatratio , R2 = 0.271881239693 , p = 0.0
Uniformity , R2 = 0.178027143455 , p = 0.0
Salsa , R2 = 0.160192091587 , p = 0.0
Wrap , R2 = 0.0228518029439 , p = 0.008

In [14]:
plt.plot(df['Fillings'],df['overall'],'k.')
plt.xlabel('Nonmeat filling flavor')
plt.ylabel('overall satisfaction')


Out[14]:
<matplotlib.text.Text at 0x11668af98>

2b. Multiple linear regression


In [15]:
model_str = 'overall ~ ' + ' + '.join(dims)
print(model_str)


overall ~ Cost + Hunger + Tortilla + Temp + Meat + Fillings + Meatratio + Uniformity + Salsa + Wrap

In [16]:
results_all = smf.ols(model_str, data=df, missing='drop').fit()
print(results_all.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                overall   R-squared:                       0.723
Model:                            OLS   Adj. R-squared:                  0.712
Method:                 Least Squares   F-statistic:                     68.48
Date:                Mon, 08 May 2017   Prob (F-statistic):           2.25e-67
Time:                        12:50:15   Log-Likelihood:                -115.21
No. Observations:                 274   AIC:                             252.4
Df Residuals:                     263   BIC:                             292.2
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -0.5042      0.245     -2.055      0.041        -0.987    -0.021
Cost           0.0217      0.023      0.932      0.352        -0.024     0.068
Hunger         0.0200      0.030      0.668      0.505        -0.039     0.079
Tortilla       0.0631      0.033      1.919      0.056        -0.002     0.128
Temp           0.0617      0.024      2.540      0.012         0.014     0.109
Meat           0.2816      0.035      8.085      0.000         0.213     0.350
Fillings       0.3752      0.037     10.044      0.000         0.302     0.449
Meatratio      0.1389      0.027      5.062      0.000         0.085     0.193
Uniformity     0.0749      0.025      3.023      0.003         0.026     0.124
Salsa          0.0597      0.027      2.174      0.031         0.006     0.114
Wrap           0.0349      0.022      1.619      0.107        -0.008     0.077
==============================================================================
Omnibus:                       51.954   Durbin-Watson:                   1.631
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              158.345
Skew:                          -0.808   Prob(JB):                     4.13e-35
Kurtosis:                       6.356   Cond. No.                         139.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Add in 'flavor synergy' to model


In [17]:
dims = ['Cost','Hunger', 'Tortilla', 'Temp', 'Meat',
        'Fillings', 'Meatratio', 'Uniformity', 'Salsa', 'Synergy', 'Wrap']
model_str = 'overall ~ ' + ' + '.join(dims)
results_all = smf.ols(model_str, data=df, missing='drop').fit()
print(results_all.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                overall   R-squared:                       0.781
Model:                            OLS   Adj. R-squared:                  0.772
Method:                 Least Squares   F-statistic:                     84.41
Date:                Mon, 08 May 2017   Prob (F-statistic):           3.58e-79
Time:                        12:50:15   Log-Likelihood:                -82.726
No. Observations:                 272   AIC:                             189.5
Df Residuals:                     260   BIC:                             232.7
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -0.4903      0.219     -2.237      0.026        -0.922    -0.059
Cost           0.0317      0.021      1.523      0.129        -0.009     0.073
Hunger         0.0187      0.027      0.698      0.486        -0.034     0.071
Tortilla       0.0309      0.030      1.043      0.298        -0.027     0.089
Temp           0.0615      0.022      2.836      0.005         0.019     0.104
Meat           0.2065      0.032      6.375      0.000         0.143     0.270
Fillings       0.2486      0.037      6.789      0.000         0.177     0.321
Meatratio      0.0880      0.025      3.481      0.001         0.038     0.138
Uniformity     0.0701      0.022      3.152      0.002         0.026     0.114
Salsa          0.0126      0.025      0.495      0.621        -0.038     0.063
Synergy        0.2978      0.036      8.334      0.000         0.227     0.368
Wrap           0.0463      0.019      2.398      0.017         0.008     0.084
==============================================================================
Omnibus:                       33.545   Durbin-Watson:                   1.623
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               79.744
Skew:                          -0.584   Prob(JB):                     4.83e-18
Kurtosis:                       5.382   Cond. No.                         144.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

2c. Correlation matrix


In [18]:
dfcorr = df[dims].corr()
M = len(dims)
from matplotlib import cm

clim1 = (-1,1)
plt.figure(figsize=(12,10))
cax = plt.pcolor(range(M+1), range(M+1), dfcorr, cmap=cm.bwr)
cbar = plt.colorbar(cax, ticks=(-1,-.5,0,.5,1))
cbar.ax.set_ylabel('Pearson correlation (r)', size=30)
plt.clim(clim1)
cbar.ax.set_yticklabels((-1,-.5,0,.5,1),size=20)
ax = plt.gca()
ax.set_yticks(np.arange(M)+.5)
ax.set_yticklabels(dims,size=25)
ax.set_xticks(np.arange(M)+.5)
ax.set_xticklabels(dims,size=25)
plt.xticks(rotation='vertical')
plt.tight_layout()
plt.xlim((0,M))
plt.ylim((0,M))


Out[18]:
(0, 11)

3. Regressing out

If you want to know if a feature contains significantly more information about the output, beyond what is contained in another feature, you can regress out the latter from the former

3a. Manufacture input features with redundant information about output variable


In [19]:
y = np.arange(1,2,.01)
x1 = y + np.random.randn(len(y))*.1
x2 = x1 + np.random.randn(len(y))*.3

In [20]:
plt.figure(figsize=(8,4))
plt.subplot(1,2,1)
plt.plot(x1,y,'k.')
plt.ylabel('y')
plt.xlabel('x1')
plt.subplot(1,2,2)
plt.plot(x2,y,'k.')
plt.xlabel('x2')

plt.figure(figsize=(4,4))
plt.plot(x1,x2,'k.')
plt.xlabel('x1')
plt.ylabel('x2')


Out[20]:
<matplotlib.text.Text at 0x11c93e470>

In [21]:
print('Correlation coefficient between x1 and y: ', np.round(sp.stats.pearsonr(x1,y)[0],3))
print('Correlation coefficient between x2 and y: ', np.round(sp.stats.pearsonr(x2,y)[0],3))
print('Correlation coefficient between x1 and x2: ', np.round(sp.stats.pearsonr(x1,x2)[0],3))


Correlation coefficient between x1 and y:  0.936
Correlation coefficient between x2 and y:  0.693
Correlation coefficient between x1 and x2:  0.749

3b. Regress out x1 from x2


In [22]:
# Regress out features
def regress_out(x, y):
    """Regress x out of y to get a new y value"""
    A = np.vstack([x, np.ones(len(x))]).T
    m, b = np.linalg.lstsq(A, y)[0]
    return y - b - x*m

x2b = regress_out(x1, x2)

In [23]:
# Visualize relationships with x2 after regressing out x1
plt.figure(figsize=(4,7))
plt.subplot(2,1,1)
plt.plot(x2b,x1,'k.')
plt.ylabel('x1')
plt.subplot(2,1,2)
plt.plot(x2b,y,'k.')
plt.ylabel('y')
plt.xlabel('x2 after regressing out x1')


Out[23]:
<matplotlib.text.Text at 0x11cae1198>

In [24]:
print('After regressing out x1 from x2:')
print('Correlation coefficient between x2 and y: ', np.round(sp.stats.pearsonr(x2b,y)[0],3))
print('Correlation coefficient between x1 and x2: ', np.round(sp.stats.pearsonr(x1,x2b)[0],3))


After regressing out x1 from x2:
Correlation coefficient between x2 and y:  -0.013
Correlation coefficient between x1 and x2:  -0.0

3c. Multiple linear regression of x1 and x2 to predict y


In [25]:
df = pd.DataFrame.from_dict({'x1':x1, 'x2':x2, 'y':y})
results_all = smf.ols('y ~ x1 + x2', data=df).fit()
print(results_all.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.877
Model:                            OLS   Adj. R-squared:                  0.874
Method:                 Least Squares   F-statistic:                     344.3
Date:                Mon, 08 May 2017   Prob (F-statistic):           8.70e-45
Time:                        12:50:17   Log-Likelihood:                 86.947
No. Observations:                 100   AIC:                            -167.9
Df Residuals:                      97   BIC:                            -160.1
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      0.0937      0.055      1.712      0.090        -0.015     0.202
x1             0.9449      0.054     17.636      0.000         0.839     1.051
x2            -0.0128      0.036     -0.353      0.725        -0.085     0.059
==============================================================================
Omnibus:                        3.020   Durbin-Watson:                   1.758
Prob(Omnibus):                  0.221   Jarque-Bera (JB):                2.313
Skew:                          -0.225   Prob(JB):                        0.315
Kurtosis:                       2.406   Cond. No.                         17.0
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [ ]: